With the advent of advances in NLP, Ranking of search results is increasingly a NLP problem. This post will give an insight into basics as well as the most recent advances.
Throughout this post, we will be discussing this from the lens of e-commerce website, building a search engine for the user to search the product.
In the figure, you can see it clearly, a user has a rough idea of the products they are looking for. This problem can be modeled in the following way.
Figure 1: Introduction
The dataset for this type of problems is typically collected by mining the logs of user-clicks. A product that the user clicks is considered a positive and ranked higher. Other methods use a human annotator for ranking the relevance of a set of documents given a query.
Before we dive into how to solve the problem, establishing the criteria to evaluate it is important. In this case, the relevance of ranked documents to the user is the primary criteria. There are multiple methods of doing it, the usual precision, recall and f-score are some methods. For IR specifically, Discounted Cumulative Gains (DCG), it’s normalized form NDCG, Mean Rank Reciprocal are used. We will be discussing them further here.
The basis of ranking is rooted in three basic approaches of loss calculation. Namely Pointwise, Pairwise and Listwise ranking.
Formalizing the problem, we have a set of queries \(Q=\{q_1,q_2....q_n\}\), for which we need to retrieve a set of relevant documents \(D=\{d_1, d_2,...d_k\}\); Each of the above approaches treats this problem differently.
In this setting, each document with respect to the query is treated independently. Given a query, all the documents are given a score of relevance.
Two documents are taken in conjunction and their precedence is predicted.
The above two approaches do not take into account the inter-document relationship, its assumed that all the documents are independent.
Listwise approaches are the closest to modeling the metrics that we had defined earlier.
Query depedent methods typically use TF-IDF [cite this] or another popular LR related approach is BM25 1 [cite this].
Here we discuss some of the seminal works.
These are pairwise approaches. In RankNet, the query \(q_i\) and document \(d_k\) features are concatenated and fed through the model. The output for a pair of query-document \(f(q_i,d_1)\) and \(f(q_i,d_2)\) is compared. The network gives a relevance score to each document and the document with the higher score is given precedence. Once it’s done for all pair of documents, they are sorted in their order of relevance. The model is trained with a cross-entropy loss. Drawback of RankNet is that the model optimizes for only document pairs but our metric to evaluate DCG, evaluates the relative ranking of documents. The penalty for incorrectly ranking the first two documents will be higher than that of last two documents.
LambdaRank is an extension of RankNet, mitigating the optimization for DCG metric by adding a \(delta\) of the NDCG score if the
Through gradient boosted trees, LambdaMart looks at the entire set of documents i.e., it’s a listwise ranking. They are branched based on their features and the results are aggregated from all the trees to give a final relevance prediction for each document.
The progression of has been from RankNet then to LambdaRank and to LambdaMART
In case of RankNet, the relevance scores of each query-document is computed and ranked accordingly. But with DSSM, the embeddings for bag-of-words vectors of query and documents are computed independently but from the same model. Then the cosine similarity with the query embedding and document is calculated for the semantic relevance score. The documents are sorted based on this score.
Figure 2: DSSM Overview
To train the network, a posterior probability is computed for the document given a query from the semantic relevance score between them through a softmax function. Random negative samples are included in a minibatch i.e., a completely irrelevant document and a relevant document are sampled and trained with a cross-entropy loss.
DSSM also uses word-hashing to tackle the large dimensionaltiy introduced when working with a web-scale vocabulary by using an MLP layer as shown in figure.
Figure 3: DSSM Architecture
TODO
DeText is a flexible pipeline that was built to rank LinkedIn’s user profiles, jobs and help FAQs based on search queries.
Figure 4: DeText Architecture
Figure 5: DeText Inference Pipeline
The amazon search ranking incorporates a Graph Neural Network in tandem with a transformer encoder. The transformer is trained with both the query and documents.
They train a DistillBERT with 6 layers and 768 hidden units.
Figure 6: Architecture
During inference, similar to DSSM, they compute cosine similarity between the query embedding and document (product) embeddings. Then finding K-Nearest Neighbors of the query. (check this)
25 because it was the 25th formula that had worked for them.↩︎